Skip to content

Enable internal text embedding API #3441

Open
ajtiwari07 wants to merge 53 commits intoAzure:mainfrom
ajtiwari07:add-internal-text-embedding-system
Open

Enable internal text embedding API #3441
ajtiwari07 wants to merge 53 commits intoAzure:mainfrom
ajtiwari07:add-internal-text-embedding-system

Conversation

@ajtiwari07
Copy link
Copy Markdown

@ajtiwari07 ajtiwari07 commented Apr 12, 2026

Summary
This PR adds configurable text chunking capabilities to the embeddings API, enabling automatic text segmentation before embedding generation. This feature supports both single-text and multi-document batch processing with runtime configuration and query parameter overrides.

Changes
Configuration
Added EmbeddingsChunkingOptions.cs - Configuration model for chunking behavior
Enabled (bool) - Enable/disable chunking
SizeChars (int) - Chunk size in characters (default: 1000)
OverlapChars (int) - Overlap between chunks (default: 250)
EffectiveSizeChars property ensures minimum valid chunk size
Modified EmbeddingsOptions.cs - Added Chunking property and IsChunkingEnabled helper
Removed EmbeddingsCacheOptions.cs - Simplified configuration by removing unused cache feature

API Enhancements
Modified Controllers/EmbeddingController.cs
Auto-detects request type (single text vs. document array)
Implements overlapping text chunking algorithm
Supports query parameter overrides: $chunking.enabled, $chunking.size-chars, $chunking.overlap-chars
Returns multiple embeddings per document when chunking is enabled
Added Models/EmbedDocumentRequest.cs - Request model for document arrays
Added Models/EmbedDocumentResponse.cs - Response model with chunked embeddings
Schema (schemas)
Modified dab.draft.schema.json - Added chunking configuration schema with validation rules
Testing (UnitTests)
Added EmbeddingsChunkingOptionsTests.cs (13 tests) - Configuration validation
Added ChunkTextTests.cs (21 tests) - Chunking algorithm validation including edge cases
Modified EmbeddingControllerTests.cs (+18 tests) - API endpoint tests for chunking and document arrays
Total Test Coverage: 72 tests (48 existing + 24 new) - All passing## Why make this change?

Testing
All 72 unit tests passing
Edge cases covered: empty text, very small chunks, overlap larger than chunk size, Unicode text
Query parameter parsing validated
Backward compatibility verified

Breaking Changes
None - This is a backward-compatible addition. Existing single-text requests continue to work without modification.

Copilot AI and others added 30 commits February 3, 2026 21:06
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…configuration

Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
… and telemetry integration

Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…ate empty embeddings

Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…oint/health sub-objects

Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
…orization

Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: JerryNixon <1749983+JerryNixon@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…dding controller tests, switching the dab schema for the embedding system to default to false
embeddings endpoint is now permanently fixed to /embed with no
user-configurable path option.

This removes unnecessary configuration surface since the feature
has not been released yet, eliminating the need for backward
compatibility.

Changes:
- Remove path property from dab.draft.schema.json
- Remove Path, UserProvidedPath, and EffectivePath from EmbeddingsEndpointOptions
- Remove EffectiveEndpointPath from EmbeddingsOptions
- Remove path deserialization from EmbeddingsOptionsConverterFactory
- Remove --runtime.embeddings.endpoint.path CLI option
- Remove path configuration logic from ConfigGenerator
- Remove endpoint path validation from RuntimeConfigValidator
- Update Startup.cs logging to use DEFAULT_PATH constant
- Update all tests to remove path references
Copy link
Copy Markdown
Contributor

@souvikghosh04 souvikghosh04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting comments so far. My review is still in progress. also, waiting for existing review comments to be addressed.

Comment thread schemas/dab.draft.schema.json
Comment thread schemas/dab.draft.schema.json Outdated
Comment thread src/Cli.Tests/ConfigureOptionsTests.cs
Comment thread src/Cli.Tests/ConfigureOptionsTests.cs Outdated
Comment thread schemas/dab.draft.schema.json Outdated
Comment thread src/Service/HealthCheck/HealthCheckHelper.cs
Comment thread src/Service/HealthCheck/HealthCheckHelper.cs Outdated
Comment thread src/Service.Tests/UnitTests/ChunkTextTests.cs Outdated
Copy link
Copy Markdown
Contributor

@souvikghosh04 souvikghosh04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posting additional comments. there are several inconsistencies between the schema JSON and the internal C# code files, including tests. e.g. threshold ms, API version, roles etc. are few to name which differes in schema JSON and the internal C# files. I will wait for these to get addressed, including pending comments.

Comment thread src/Service.Tests/UnitTests/EmbeddingsOptionsTests.cs
Comment thread src/Config/ObjectModel/Embeddings/EmbeddingsHealthCheckConfig.cs Outdated
Comment thread schemas/dab.draft.schema.json Outdated
@ajtiwari07
Copy link
Copy Markdown
Author

posting additional comments. there are several inconsistencies between the schema JSON and the internal C# code files, including tests. e.g. threshold ms, API version, roles etc. are few to name which differes in schema JSON and the internal C# files. I will wait for these to get addressed, including pending comments.

I have revisited the config defaults and made them consistent across repo.

Copy link
Copy Markdown
Contributor

@souvikghosh04 souvikghosh04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posting additional comments. tests still need refactoring and not duplicated.

Comment thread src/Core/Configurations/RuntimeConfigValidator.cs
Comment thread src/Core/Services/Embeddings/EmbeddingTelemetryHelper.cs
Comment thread src/Core/Services/Embeddings/EmbeddingService.cs
Comment thread src/Service/HealthCheck/HealthCheckHelper.cs Outdated
Comment thread src/Service.Tests/UnitTests/ChunkTextTests.cs
Comment thread src/Service.Tests/UnitTests/EmbeddingsChunkingOptionsTests.cs
Copy link
Copy Markdown
Contributor

@souvikghosh04 souvikghosh04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved based on the suggestions and discussions in the comments.

@souvikghosh04
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 6 pipeline(s).

@ajtiwari07
Copy link
Copy Markdown
Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 3441 in repo Azure/data-api-builder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

10 participants